Toward Efficient Post-Training Quantization of Pre-Trained Language Models
131
convex combination between the full-precision fn and quantized ˆfn as follows:
˜fn = λfn + (1 −λ) ˆfn.
(5.11)
The hyperparameter λ controls the strength of teacher forcing. λ = 1 gives the full cor-
rection of reconstruction error but with forward inconsistency, e.g., the connection between
the current module and previous quantized modules is broken. While λ = 0 reduces for-
ward inconsistency, it suffers from the propagated reconstruction error. To achieve a good
trade-offbetween reconstruction error reduction and forward inconsistency elimination, a
linear decay strategy for λ is proposed:
λt = max(1 −t
T0
, 0),
(5.12)
where T0 is the preset maximum steps of the decay. In the beginning, a large λ is desired
since each module is rarely optimized. Later, a small λ is preferred to transit to normal
training such that the forward inconsistency can be bridged. The remaining T −T0 steps
stick to normal training so that each quantized module adapts to its own predecessors.
The comparsion between the proposed method and other existing state-of-the-art BERT
quantization methods are presented in Table 5.4. From Table 5.4, both the proposed MREM-
S and MREM-P outperform existing PTQ approaches in most cases, and even achieve results
close to QAT approaches. For example, the “W4-E4-A8” quantized MREM-S and MREM-P
have the averaged accuracies of 83.5% and 83.4% on MNLI respectively are on par with
“W2/4-E8-A8” quantized Q-BERT. In terms of the “W2-E2-A8” quantized models, our
MREM-S and MREM-P surpass GOBO by 11.7% ↑and 11.3% ↑on MNLI-m, respectively.
In summary, this paper’s contributions are as follows: (1) module-wise reconstruction
error minimization (MREM) that is a fast, memory-saving, and data-efficient approach
to improve the post-training quantization for language models; (2) a new model parallel
strategy based on MREM to accelerate post-training quantization with theoretical speed-
up for distributed training; and (3) annealed teacher forcing to alleviate the propagation of
reconstruction error and boost the performance.
TABLE 5.4
Results on the GLUE development set. “MREM-S” denotes sequential optimization.
Quantization #Bits (W-E-A)SizePTQMNLI-mQQPQNLISST-2CoLASTS-BMRPCRTEAvg.
-
full-prec.
418
-
84.9
91.4 92.1
93.2
59.7
90.1
86.3
72.2 83.9
Q-BERT
2-8-8
43
-
76.6
-
-
84.6
-
-
-
-
-
Q-BERT
2/4-8-8
53
-
83.5
-
-
92.6
-
-
-
-
-
Quant-Noise
PQ
38
-
83.6
-
-
-
-
-
-
-
-
TernaryBERT
2-2-8
28
-
83.3
90.1 91.1
92.8
55.7
87.9
87.5
72.9 82.7
GOBO
3-4-32
43
✓
83.7
-
-
-
-
88.3
-
-
-
GOBO
2-2-32
28
✓
71.0
-
-
-
-
82.7
-
-
-
MREM-S
4-4-8
50
✓
83.5
90.2 91.2
91.4
55.1
89.1
84.8
71.8 82.4
2-2-8
28
✓
82.7
89.6 90.3
91.2
52.3
88.7
86.0
71.1 81.5
MREM-P
4-4-8
50
✓
83.4
90.2 91.0
91.5
54.7
89.1
86.3
71.1 82.2
2-2-8
28
✓
82.3
89.4 90.3
91.3
52.9
88.3
85.8
72.9 81.6
Note: “MREM-P” denotes parallel optimization. “Size” refers to model storage in “MB”. “PTQ”
indicates whether the method belongs to post-training quantization.“Avg.” denotes the average
results of all tasks.